{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python几个高级常用函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "介绍几个较为常用的高级函数：\n",
    "* Counter：计数器\n",
    "* defaultdict：带默认值的字典\n",
    "* map、reduce、filter：针对序列操作的函数\n",
    "* groupby：类似SQL中groupby的聚合函数（但只会相邻相同元素聚合）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Counter 计数器\n",
    "\n",
    "Counter（计数器）：用于追踪值的出现次数，Counter类继承dict类，所以它能使用dict类里面的方法 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 对iterable进行计数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter = collections.Counter(['a', 'b', 'c', 'a', 'b', 'a'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'a': 3, 'b': 2, 'c': 1})"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 使用update可以往Counter新增内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter.update(['c', 'b', 'c', 'd', 'c'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'a': 3, 'b': 3, 'c': 4, 'd': 1})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 可以像字典一样输出key/value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a 3\n",
      "b 3\n",
      "c 4\n",
      "d 1\n"
     ]
    }
   ],
   "source": [
    "for key, value in counter.items():\n",
    "    print(key, value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 很方便的输出最高频率的数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('c', 4), ('a', 3), ('b', 3)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counter.most_common(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. defaultdict\n",
    "\n",
    "直接访问字典不存在的key是会报错的，所以一般需要做初始化：\n",
    "```\n",
    "if key not in d:\n",
    "  d[key] = 0\n",
    "  d[key] = []\n",
    "d[key] += 3\n",
    "d[key].append(123)\n",
    "```\n",
    "dict =defaultdict(factory_function)  \n",
    "factory_function可以是str、int、list、set，可以省略初始化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 默认的value为int类型，直接加数字"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(int, {'a': 3, 'b': 4})"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dint = defaultdict(int)\n",
    "dint[\"a\"] += 3\n",
    "dint[\"b\"] += 4\n",
    "\n",
    "dint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 默认的value为list类型，直接添加元素"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(list, {'a': [1, 2, 3, 4]})"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dlist = defaultdict(list)\n",
    "dlist[\"a\"].append(1)\n",
    "dlist[\"a\"].extend([2,3,4])\n",
    "\n",
    "dlist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. map、reduce、filter函数\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### map(function, iterable, ...)\n",
    "给序列的每个元素应用一个函数，返回一个迭代器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "l = [1,2,3,4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<map at 0x7f64a079a850>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = map(lambda x: x**2, l)\n",
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### reduce(function, iterable)\n",
    "\n",
    "使用function(x, y)函数，将序列缩减成1个元素结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "l = [1,2,3,4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from functools import reduce\n",
    "\n",
    "reduce(lambda x, y: x+y, l)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### filter(function, iterable)\n",
    "\n",
    "使用返回bool的function对序列过滤，返回满足条件的结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "l = range(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<filter at 0x7f64a0788f90>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = filter(lambda x: x%2==0, l)\n",
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 2, 4, 6, 8]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. groupby函数\n",
    "\n",
    "按照指定key进行数据分组：groupby(iterable, key=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文章ID、文章标题、文章分类\n",
    "datas = [\n",
    "    {\"id\":101, \"title\":\"标题1\", \"category\":\"Python\"},\n",
    "    {\"id\":102, \"title\":\"标题2\", \"category\":\"Java\"},\n",
    "    {\"id\":103, \"title\":\"标题3\", \"category\":\"C++\"},\n",
    "    {\"id\":104, \"title\":\"标题4\", \"category\":\"Python\"},\n",
    "    {\"id\":105, \"title\":\"标题5\", \"category\":\"Python\"},\n",
    "    {\"id\":106, \"title\":\"标题6\", \"category\":\"Java\"}\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import groupby  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 先将数据按目标key排序"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'id': 103, 'title': '标题3', 'category': 'C++'},\n",
       " {'id': 102, 'title': '标题2', 'category': 'Java'},\n",
       " {'id': 106, 'title': '标题6', 'category': 'Java'},\n",
       " {'id': 101, 'title': '标题1', 'category': 'Python'},\n",
       " {'id': 104, 'title': '标题4', 'category': 'Python'},\n",
       " {'id': 105, 'title': '标题5', 'category': 'Python'}]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datas = sorted(datas, key=lambda x:x[\"category\"])\n",
    "datas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 使用groupby可以对相邻相同元素做聚合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import groupby  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<itertools.groupby at 0x7f64a0748c50>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = groupby(datas, key=lambda x : x[\"category\"])\n",
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('C++', <itertools._grouper at 0x7f64a0747650>),\n",
       " ('Java', <itertools._grouper at 0x7f64a0747890>),\n",
       " ('Python', <itertools._grouper at 0x7f64a0747a90>)]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C++ [{'id': 103, 'title': '标题3', 'category': 'C++'}]\n",
      "Java [{'id': 102, 'title': '标题2', 'category': 'Java'}, {'id': 106, 'title': '标题6', 'category': 'Java'}]\n",
      "Python [{'id': 101, 'title': '标题1', 'category': 'Python'}, {'id': 104, 'title': '标题4', 'category': 'Python'}, {'id': 105, 'title': '标题5', 'category': 'Python'}]\n"
     ]
    }
   ],
   "source": [
    "# 因为是迭代器，消耗完了就没有数据了\n",
    "result = groupby(datas, key=lambda x : x[\"category\"])\n",
    "for key, values in result:\n",
    "    print(key, list(values))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
